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Abstract. Cross-referenced parallel markup for mathematics allows the 
combination of both presentation and content representations while as¬ 
sociating the components of each. Interesting applications are enabled by 
such an arrangement, such as interaction with parts of the presentation 
to manipulate and querying the corresponding content, and enhanced 
search indexing. Although the idea of such markup is hardly new, effec¬ 
tive techniques for creating and manipulating it are more difficult than it 
appears. Since the structures and tokens in the two formats often do not 
correspond one-to-one, decisions and heuristics must be developed to de¬ 
termine in which way each component refers to and is referred to by com¬ 
ponents of the other representation. Conversion between fine and coarse 
grained parallel markup complicates ID assignments. In this paper, we 
will describe the techniques developed for DIB^ml, a 'I]gX/IAR]Xto XML 
converter, to create cross-referenced parallel MathML. While we do not 
yet consider DT^xml’s content MathML to be useful, the current effort 
is a step towards that continuing goal. 


1 Introduction 

Parallel markup for mathematics provides the capability of providing alternative 
representations of the mathematical expression, in particular, both the presen¬ 
tation form of the mathematics, i.e. its appearance, along with the content form, 
i.e. it’s meaning or semantics. Cross-linking between the two forms provides the 
connection between them such that one can determine the meaning associated 
with every visible fragment of the presentation and, conversely, the visible man¬ 
ifestation of each semantic sub-expression. Thus cross-linked parallel markup 
provides not only the benefits of of the presentation and content forms, individ¬ 
ually, but support many other applications such as: hybrid search where both the 
presentation and content can be taken into account simultaneously; interactive 
applications where the visual representation forms part of the user-interface, but 
supports computations based on the content representation. 

Of course, the idea of parallel markup is hardly new. The m:semantics ele¬ 
ment has been part of the MathML specification[?] since the first version, in 
1998! What seems to be missing are effective strategies for creating, manip¬ 
ulating and using this markup. Fine-grained parallelism is when the smallest 
sub-expressions are represented in multiple forms, whereas with coarse-grained 

* The final publication is available at http://link.springer.com 



parallelism the entire expression appears in the several forms. Fine-grained paral¬ 
lelism is generally easier to create initially, and particularly when one deals with 
complex ‘transfix’ notations, or wants to preserve the appearance, but can in¬ 
fer the semantic intent, of each sub-expression. Coarse-grained is often required 
by applications which may understand only a single format, or are unable to 
disentangle the fined-grained structure. HTML5 only just barely accepts coarse¬ 
grained parallel markup, for example. Conversion from fine to coarse grained is 
not inherently difficult, it can be carried out by a suitable walk of the expression 
tree for each format. But what isn’t so clear is how to maintain the associations 
between the symbols (or more generally, the nodes) in the two trees. Indeed, 
since there is typically no one-to-one correspondence between the elements of 
each format. Fine-grained parallelism, by itself, doesn’t guarantee a clear asso¬ 
ciation between all the symbols between the branches. 

Our context here is I^T[5XML, a converter from to XML, and thence 

to web appropriate formats such as HTML, MathML and OpenMath. Input doc¬ 
uments range from highly semantic markup such as sTeX[?], to intermediate such 
as used in DLMF[?], to fairly undisciplined, purely presentational, markup as 
found on arXiv[?]. TJ^X induces high expectations for quality formatting forcing 
us to preserve the presentation of math. Meanwhile, the promise of global digital 
mathematics libraries and the potential reuse of a legacy of mathematics mate¬ 
rial encourages us to push as far as possible the extraction of content from such 
documents. At the very least, we should preserve whatever semantics is avail¬ 
able in order to enable other technologies and research, such as LLaMaPuN)?], 
to resolve the remaining ambiguities. 

In this paper, we describe the markup used in IXIJ 5 XML both for macros with 
known semantics, and for the result of parsing, and strategies for conversion 
to cross-linked, parallel markup combining Presentation MathML (pMML) and 
Content MathML (cMML). It should be noted that this does not mean that 
IXIh;xML is producing useful quality cMML; the current work is a stepping stone 
towards that long-term goal. 

2 Motivation 

Before diving into examples, a brief introduction to IMJjkml’s internal mathe¬ 
matics markup, informally called XMath, is in order. This markup, inspired by 
OpenMath and both pMML and cMML, is intentionally hybrid in order to cap¬ 
ture both the presentation and content properties of the mathematical objects 
throughout the step-wise processing from raw Tj^X markup, through parsing 
and, ultimately, semantic annotation. Please see the online manuals for more 
details. 

XMApp generalized application (think m:apply or om:OMA); 

XMTok generalized token (think m:mi, m:mo, m:mn, mxsymbol); 

XMDual parallel markup container of the content and presentation branches; 

^ http://dlmf.nist.gov/LaTeXML/manual/ 



Listing 1.1. Internal representation of a + F{a,b), after parsing (assuming F 
as a function) 

<XMApp!> 

<XMTok mea ning—" p I u s ” r o I e—”ADD0P”>4</XMTok> 

<XMTok role—"ID" font—” i t a I i c ”>a</XMTok> 

<XMDual> 

<XMApp> 

<XIVlRef i d r e f —" ml. 1 ”/> 

<XMRef id ref^"ml.2’7> 

<XMRef idref^"ml.3”/> 

</XMApp> 

<XMApp> 

<XMTok ro I e^” FUNCTION” xm I: i dml. 1 ” fo nt^" i t a I i c ">F</XMTolO* 

<XMWrap> 

<XMTok role^”OPEN” st r e t c h y^" f a I s e ">(</XMTol^ 

<XMTok role=”ID'' x m I: i d =" ml. 2 ” f o n t='' i t a 1 i c ">a</XMTolO 
<XMTok role^”PUNCT"> ,</XMTok> 

<XMTok role=”ID'' x m I: i d ml. 3 ” f o n t=" i t a 1 i c ">b</XMTolO 
<XMTok role—"CLOSE” st r et c h y—” f a I s e ”> )</XMTok> 

</XMWrap> 

</XMApp> 

</XMDual> 

</XMApp> 


XMRef shares nodes between branches of XMDual, via xml:id and idref attributes; 
XMWrap container unparsed sequences of tokens or subtrees (think m:mrow). 

By way of motivation, consider the simple example in Listing 11.11 The role 
attribute on tokens indicates the syntactic role that it plays in the grammar; 
in this case, we’ve asserted that is a function, allowing the expression to be 
parsed. At the top-level, the sum requires no special parallel treatment since the 
presentation for infix operators is trivially derived from the content form (i.e. 
the application of ‘-I-’ to its arguments). The application of F to its arguments 
benefits somewhat from parallel markup. This is a typical situation with the 
fine-grained XMDual: the content branch is the application of some function or 
operator (here F) to arguments (here a, b), but they are represented indirectly 
using XMRef to point to the corresponding sub-expressions within the presenta¬ 
tion. While one could represent the delimiters and punctuation as attributes (as 
in MathML’s m:mfenced), that loses attributes of those attributes such as stretch¬ 
iness, size or even color. A more compelling case is made when more complex 
transfix notations or semantic macros are involved, as we will shortly see. 

However, this simple example already hints at a hidden complexity. Con¬ 
verting to either pMML and cMML is straightforward (given rules for mapping 
XMath elements to MathML): simply walk the tree, following each XMRef to the 
referenced node and choosing the first or second branch of XMDual for content or 
presentation, respectively. Even cross-linking is straightforward in the absence 
of XMDual, when the generated content or presentation nodes are ‘sourced’ from 
the same XMath node {F, a, and 5, in the example): we simply assign ID’s to the 
source XMath node and the generated nodes and record the association between 
the two; afterwards, the presentation and content nodes that were sourced from 
the same ID get an xref attribute referring to the other, in order to connect them. 





Listing 1.2. MathML representation of a + F{a, h) 

<math d i spI ay=” bIock” a 111ext=" a+F(a , b)" c I a ss=” Itx_M ath” id=”ml”> 
<semantics id=”mla”> 

<mrow Xref="ml. 7 . cmml” id="ml.7”> 

<mi Xref=" ml. 4 . cmml” i d=" ml. 4 ”>a</mi> 

<mo X r ef=" ml. 5 . cmml” i d=" ml. 5 ”>4</mo> 

<mrow Xref=''ml. 6 . cmml" id="ml.6d''> 

<mi X ref=''ml. 1 . cmml" i d=''ml. 1 ”>F</mi> 

<mo X r ef=" ml. 6 . cmml” i d=” ml. 6 e"><S^A p p Iy F u nct ion ;</mo> 

<mrow Xref=''ml. 6 . cmmi” id="ml.6c''> 

<mo Xref=”ml. 6 . cmmi” id=”ml.6 ” st retchy=” f a Ise ”>(</mo> 
<mi X ref=” ml. 2 . cmmi” i d=” ml. 2”>a</mi> 

<mo X r ef=" ml. 6 . cmmi” i d=’'ml. 6 a"> ,</mo> 

<mi X ref=” ml. 3 . cmmi” i d=” ml. 3”>b</mi> 

<mo X r e f=” ml. 6 . cmmi” id=”ml.6b” st r e tc h y=” f a I s e ">)</mo> 
< / m row> 

< / m row> 

< / m row> 

<a n n Ota t i o n —xml id=”mlb” e n cod i n g=" MathML—Content”> 

<apply xref=”ml.7" i d=" ml. 7 . cmmr'> 

<plus xref=”ml.5” i d=” ml. 5 . cmmi”/> 

<ci xref=”ml.4” i d=” ml. 4 . cmmr’>a</c i> 

<apply xref=”ml.6d” i d=” ml. 6 . cmml”> 


<c i X r ef=” ml. 1 ” 

i d=” ml. 1. cmml”>F</ c i> 

<ci xref="ml.2” 

i d=” ml. 2 . cmml”>a</ c i> 

<ci xref="ml.3” 

id=”ml.3.cmml”>b</ ci> 

</apply> 


</apply> 


</ annotation — x m l> 


<annotation id="mlc" 

e n cod i n g=" application /x—tex ”>a+F (a , b)</ annotatio n> 


</ semanti cs> 
</math> 


But with XMDual one has not only to determine when the generated nodes are 
related, one has to contend with extra tokens; in the example, the parentheses 
and comma appear only in the presentation. Presumably, those tokens should 
be associated with the application of F, as would the containing m:mrow? The 
desired result is shown in Listing [TT^ 

A fuller illustration of the issues encountered in typical markup com¬ 

bines complex transfix notations and semantic macros, such as: 


\left\langle\Psi\middleI\mathcal{H}\middleI\Phi\right\rajigle 
+ \def int{a}-{b}-{F(x)}{x} 


This example, whose internal form is shown in Listing 11.31 involves quantum- 
mechanics notations, which ]Ari5x;ML’s parser is happily able to recognize. Ad¬ 
ditionally, we’ve introduced a semantic macro \defint to represent definite in¬ 
tegration, which will be transformed to so-called ‘Pragmatic’ Content MathML 
form, to enhance the illustration with a many-to-many correspondence. (The 
implementation of \def int is not difficult, but outside the scope of this article) 






Listing 1.3. Internal representation of {'P\'H\'1>) + F{x)dx 

<XMApft> 

<XMTok mea ning—" p I u s ” r o I e—”ADD0P”>4</XMTok> 

<XMDual> 

<XMApp> 

<XMTok meaning=’’quantum—operator—product” /> 

<XMRef id ref^"m2.5’7> 

<XMRef id ref^"m2.6’7> 

<XMRef id ref^"m2.7’7> 

</XMApp> 

<XMWrap> 

<XMTok role^”OPEN”>(</XMTolO 

<XMTok role—"ID" x m I: i d —" m2.5 ”>'^^</XMTol^ 

<XMTok role—"CLOSE” st r e t c h y—” t r u e "> |</XMTol^ 

<XMTok role=”ID" xm I: i d =” m2.6 ” fo nt=" c a I i g r a p h i c ”>H</XMTok> 

<XMTok role—"OPEN" s t r e t c h y—" t r u e ”> |</XMTok> 

<XMTok role—"ID" x m I: i d —" m2.7 ”>^</XMTok> 

<XMTok role^”CLOSE”>)</XMTolC> 

</XMWrap> 

</XMDual> 

<XMDual> 

<XMApp> 

<XMTok mea n i ng=” hack —d e f i n i t e — i n t e g r a I” r o I e="UNKN0WN'7> 

<XMRef idref^"m2.1”/> 

<XMRef idref^"m2.2”/> 

<XMRef id ref^"m2.3’7> 

<XMRef idref^"m2.4”/> 

</XMApp> 

<XMApp> 

<XMApp> 

<XMTok role^”SUPERSCRIPTOP" scriptpos^" po5t2"/> 

<XMApp> 

<XMTok r o I e—” SUBSCRIPTOP " s c r i p t p o s—" po5t2 ”/> 

<XMTok mathstyle=” display” meanings” integral” role =" INTO P”> /</XMTolO 
<XMTok r o I e=” ID” x m I: i d =” m2.1 ” f o n t=” italic ”>a</XMTolO 
</XMApp> 

<XMTok r o I e=” ID ” x m I: i d =” m2.2 ” f o n t=” italic ”>b</XMTolC> 

</XMApp> 

<XMApp> 

<XMTok mea ning=” t i mes” r o I e="MULOP"X/XMTok> 

<XMDual Xm I: i d —" m2.3 ”> 

<XMApp> 

<XMRef idref^"m2.3.1”/> 

<XMRef idref^"m2.3.2”/> 

</XMApp> 

<XMApp> 

<XMTok role^”FUNCTION” x m I: i d m2.3.1 " font^” i t a I i c ”>F</XMTol^ 
<XMWrap> 

<XMTok role^”OPEN” st r et c h y^” f a I s e ”>(</XMTolO 

<XMTok role^”UNKNOWN" xm I: idm2.3.2 " font^" i t a I i c ”>x</XMToK> 
<XMTok role—"CLOSE" stretchy—” fa I se ">)</XIVITolO 
</XMWrap> 

</XMApp> 

</XMDual> 

<XMApp> 

<XMTok meanings” differential —d” r o I e=” DIFFOP” f o n t=” italic ”>d</XMTok> 
<XMTok role^”UNKNOWN" xm I: idm2.4" font^" i t a I i c ">x</XMToK> 

</XMApp> 

</XMApp> 

</XMApp> 

</XMDual> 

</XMApp> 





3 Algorithm 


The goal is to associate each of the generated target pMML (or cMML) nodes 
with some node(s) in the generated cMML (or pMML, respectively). We do this 
by ascribing to each generated node a source XMath node, not necessarily the 
current node, the one that directly generated the target node. Once this is done, 
the cross-referencing is easily established: the xref of a pMML (cMML) node is 
the cMML (pMML, respectively) node that was ascribed to the same source 
XMath node; if multiple nodes were ascribed to that source node, the first target 
node, in document order, is the sensible choice. 

A key to deciding which XMath node to ascribe as the source is whether the 
node is visible to either or both branches. The common, simple, case is an XMath 
node, visible to both branches, that generates a token node in the target; in that 
case the current node is used as the source. Node visibility can be determined 
by an algorithm such as the marking part of mark-and-sweep garbage collection. 

However, MathML elements which are containers generally do not corre¬ 
spond to symbols, and ought to be associated with the nearest application (think 
m:apply or m:mrowj^. In this case, the source should be the nearest ancestor XM- 
Dual of the current XMath node, which we’ll call the current eontainer. 

Similar reasoning applies in the special case when a token symbol (non¬ 
container) is generated from an XMath token which is not visible to the opposite 
branch; it may simply be notational icing of some transfix, or it may be the only 
visible manifestation of what we’ll call the current operator. The current opera¬ 
tor is the top-most operator being applied within the current container. In the 
example, the angle brackets and vertical bars are the only visible manifestation 
of the quantum-operator-product operator. 

In summary, the source node for a given target is 

if target is a container 
if current container exists 
current container 
else 
current 

else target is visible in both branches 
current 

else if current container exists 

if current operator is hidden from presentation 
current operator 
else 

current container 

else 

current 

^ Exceptions are m:msqrt or m:menclose where they tend to represent both the ap¬ 
plication of an operation and yet are the only visible manifestation of the operator! 
However, we also note that a common use of cross-linking in HTML is to turn them 
into href links; but HTML does not allow nested links! 



Applying this method to the example from Listing 11.31 yields 11.41 where we can 
see, for example, that the angle brackets and vertical bars are associated with 
the quantum-operator-product operator while the various m:bvar, m:lowlimit, 
etc, are properly associated with the integral, not the integral operator. 


4 Outlook 

The Digital Library of Mathematical Function^ had from the outset linkage 
from (most) symbols to their definitions. However this new approach to the 
problem provides a much cleaner implementation, and allowed the mechanism 
to be extended to less textual operators such as binomials, floor, 3-j symbols, 
etc. 

Parallel markup must also be adapted to larger structures such as eqnar- 
ray, and AMS alignments with intertext containing multiple formula and/or 
document-level text markup. While the fundamental issue is the same — sep¬ 
arating presentation and content forms — this seems to demand a distributed 
markup that separates the presentation and content forms into distinct math 
containers. DTexml currently has an ad-hoc, but not entirely satisfactory solu¬ 
tion for this, but we will experiment with adapting the methods described here. 
However, it remains to be seen whether cross-referencing across separate math 
containers can be made useful. 

And, now that generating Content MathML is more fun, we must continue 
work towards generating good Content MathML. Ongoing work will attempt to 
establish appropriate OpenMath Content Dictionaries, probably in a FlexiFor- 
mal sense)?], improved math grammar, and exploring semantic analysis. 


http://dlmf.nist.gov/ 
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Listing 1.4. MathML representation of {'P\'H\'1>) + j^F{x)dx 

<math displays” block” aittext class =”ltx_Math” id=” m2”> 

<semantics id=”m2a”> 

<mrow X r e f=” m2.1 3 . cmmi” id=”m2.13”> 

<mrow Xref=”m2.9 . cmmi” id=”m2.9”> 

<mo X ref=” m2.8 . cmmi” i d=” m2.8 ">^Left Ang I e B ra cket ;</mo> 

<mi m a t h va r i a n t=” norm a I ” x r e f=” m2.5 . cmmi” i d=” m2.5 ”>&:Ps i ;</mi> 

<mo X r e f=” m2.8 . cmmi” id=”m2.8a” st r e tc h y=" t r u e ” f e n ce=” t r u e "> | </mo> 

<mi Xref=”m2.6 . cmmi” id=”m2.6” c I a ss=” 11x_fo n t _m a t h ca I i gr a ph i ci I bertS pace ;</mi> 

<mo X r e f=” m2.8 . cmmi” id=”m2.8b'' s t r e tc h y=” t r u e ” f e n ce=” t r u e ”> |</mo> 

<mi m a t h va r i a n t=” norm a I ” x r e f=” m2.7 . cmmi” i d=” m2.7 ”>S4Ph i ;</mi> 

<mo X ref='’m2.8 . cmml” i d=” m2.8 c">«S4Right A ngIe B ra cket ;</mo> 

< / m row> 

<mo X ref=” m2.1 0 . cmmi” i d=” m2.10”>4</mo> 

<mrow X r e f=” m2.1 2 . cmmi” id=”m2.12c”> 

<msubsup Xref=” m2.1 2 . cmmi” id=”m2.12”> 

<mo X r e f=” m2.11. cmmi” i d=” m2.11” symmetric="true” largeop=”true i n t ;</mo> 

<mi X ref=” m2.1 . cmmI” i d=” m2.1 ”>a</mi> 

<mi X ref=” m2.2 . cmmI” i d=” m2.2 ”>b</mi> 

</ msu bsu p> 

<mrow X r e f=” m2.1 2 . cmmI” id=”m2.12b”> 

<mrow Xref=”m2.3 . cmmI” id=”m2.3c"> 

<mi X ref=” m2.3.1 . cmmI” i d=” m2.3.1 ”>F</mi> 

<mo X ref=” m2.3 . cmmI” i d=” m2.3d”>&'Ap p I y F u n ction ;</mo> 

<mrow X r e f=” m2.3 . cmmI” id=”m2.3b”> 

<mo Xref=”m2.3 . cmmI” id=”m2.3” st retchy=” f a I se ”>(</mo> 

<mi X ref=” m2.3.2 . cmmI” i d=” m2.3.2 ”>x</mi> 

<mo Xref=”m2.3 . cmmI” id=”m2.3a” st retc hy=” fa I se ">)</mo> 

< / m row> 

< / m row> 

<mo X ref=” m2.11. cmmI” i d=” m2.11 a">«S41 n v i s i b I eT i m es ;</mo> 

<mrow X r e f=” m2.1 2 . cmmI” id=”m2.12a”> 

<mo X ref=” m2.1 1. cmmI” i d=” m2.11 b">d</mo> 

<mi X ref=” m2.4 . cmmI” i d=” m2.4”>x</mi> 

< / m row> 

< / m row> 

< / m row> 

< / m row> 

<a n n Ota t i o n —xml id=”m2b” e n cod i n g=" MathML—Content”> 

<apply xref=”m2.13” i d=” m2.13 . cmml”> 

<plus xref=”m2.10" i d=” m2.1 0 . cmmI”/> 

<apply xref=”m2.9” i d=” m2.9 . cmml”> 

<csymbol xref=”m2.8” i d=” m2.8 . cmmI” cd=” I a tex m l”>qua ntum —o pe ra to r—p rod u ct</csy m bol> 
<ci xref=”m2.5” i d=” m2.5 . cmml”>norm a I—(S^P s i ;</c I> 

<ci xref=”m2.6” id=” m2.6 . cmml”>SiH i I bertSpace ;</c i> 

<cl xref=”m2.7” i d=” m2.7 . cmml”>n orm a I—(S^P h i ;</c I> 

</apply> 

<apply X r e f=” m2.12 c" i d=” m2.1 2 . cmml”> 

<int xref=”m2.11” i d=” m2.11. cmml”/> 

<bvar X r e f=” m2.12 c" i d=” m2.12 a . cmml”> 

<ci xref=”m2.4” id=” m2.4 . cmml”>x</ci> 

</bva r> 

<lowlimit X r e f=” m2.12 c” i d=” m2.12 b . cmmr'> 

<ci xref=”m2.1” i d=” m2.1. cmml”>a</c i> 

</ I o w I i m i t> 

<lowupper Xref=”m2.12 c" id=” m2.12 c . cmml”> 

<ci xref=”m2.2” i d=” m2.2 . cmml”>b</c i> 

</lowupper> 

<apply xref=”m2.3c” i d=” m2.3 . cmml”> 

<ci X r e f=” m2.3.1 ” i d=” m2.3.1 . cmml”>F</c i> 

<ci X r e f=” m2.3.2 ” i d=” m2.3.2 . cmml”>x</c i> 

</apply> 

</apply> 

</apply> 

</ annotation —x m l> 

<annotation id=”m2c" encodin g=” application /x—tex ”> . . .</ annotatio n> 

</ semant i cs> 

</math> 



